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Abstract 

We present a linear-space data structure which enables very fast (usually constant time) 
answers to several types of internal queries — questions about factors (also called substrings) of 
a text. A factor-in-factor occurrence query asks for a representation of the set of all occurrences 
of one factor x in another factor y of the same text v of length n. It assumes that \y\ = 0(p|), 
in this case the representation consists of a constant number of arithmetic progressions. This 
problem can be viewed as an internal version of the well-studied pattern matching problem. Our 
data structure is optimal: it has linear size and the query time is constant, also the construction 
time is linear. Using the solution to the factor-in-factor problem, we obtain very efficient data 
structures answering queries about: primitivity of factors, periods of factors, general substring 
compression, and cyclic equivalence of two factors. All these results improve upon the best 
previously known counterparts. Using our data structure for the period queries, we also provide 
the best known solutions for the recently introduced factor suffix selection queries and for finding 
(5-subrepetitions in a text (a more general version of maximal repetitions, also called runs). With 
the latter improvement we obtain the first linear time algorithm finding 5-subrepetitions for a 
fixed 5, which matches the linear time complexity of the algorithm computing runs. We benefit 
here from the linear construction time of our data structure. 

The model of internal queries in texts is connected to the well-studied problem of text in¬ 
dexing. Both models have their origins in the introduction of suffix trees. However, there is 
an important difference: in our modei the size of the representation of a query is constant and 
therefore enabies faster query time. Our results can be viewed as efficient solutions to “internal” 
equivalents of several basic problems of regular pattern matching and make an improvement 
in a majority of related already published results. We introduce several novel techniques ex¬ 
tending the method of pattern matching by sampling. We apply probabilistic tools, related to 
range minima in random permutations. The construction algorithms of our data structures are 
randomized but the queries are deterministic. 
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1 Introduction 


There are many algorithmic problems concerning factors (substrings) in a word. In these problems 
we need to construct a data structure which answers efficiently queries specified by factors of a 
given word. This constitutes a growing held in the area of text processing. Its origins start with 
the invention of suffix trees that can be used to answer the most basic types of internal queries: 
equality of factors and longest common prefix queries, with constant query time and linear space. 

One of the first studies in this area, on a family of problems of compressibility of factors, was 
given by Cormode and Muthukrishnan (SODA’05) [6]; some of these results were later improved 
in [21]. Other typical problems include: range longest common prefix queries (range LCP) [1, 32], 
periodicity [22, 10], minimal/maximal suffixes [3], etc. 

A related model is text indexing, in which one desires to preprocess a given word for future 
queries specified by (usually shorter) patterns. In this setting the query time is f2(m), where m 
is the size of the query pattern. Our model better fits the scenario when a number of data texts 
are stored and we only query for factors of these texts. Indeed, each query can now be specified in 
constant space and therefore o(m) time algorithms answering queries are possible. 

A routine approach to factor-related queries is based on applications of orthogonal range search¬ 
ing, see [26]. For the current state of knowledge this implies Q(loglog n) query time and, in most 
cases, super-linear space. Moreover, the construction time is f^nydogn), most often P(nlogn). We 
design tools based on text processing that are better-tailored for factor-related queries and allow to 
obtain constant query time with linear space and (expected) linear construction time. We benefit 
from the fast construction time when we apply our techniques to problems in a static setting. 

We identify one of the basic problems in this new area which we call factor-in-factor occurrence 
queries and show its usefulness. This problem can be viewed as a direct analogue of the well-studied 
pattern matching problem. We also consider a number of problems related to periods of factors. 
Computation of different types of periodicities is one of the central parts of algorithmics on words. 
A similar type of queries (for tiling periodicity) was studied in [20]. A natural extension of testing 
equality of words is the problem of cyclic equivalence, also called conjugacy [27]. We introduce this 
problem in the context of factors and give an optimal solution. As applications of our results we 
obtain faster solutions to a few recently studied problems of pattern matching and text compression. 

We consider linearly sort.able alphabets, that is, we assume that E, the set of letters of the given 
word v, can be sorted in linear time (e.g. E C {0,1,... , H^ 1 )}). A factor of v is a word of the 
form v[i]... v[j]. The factors in each query are represented by a start- and end-position of their 
occurrence. The results are for word-RAM model with word size w = 0(log n). with n being the 
length of v, and the algorithms are deterministic, unless otherwise stated. 

1.1 Previous Work 

We say that a positive integer p is a period of v if there exists a word u of length p and an integer k 
such that v is a prefix of u k . The word is called periodic if it has a period at most half of its length. 
The following three types of queries were already studied. 

Period Queries 

Given a factor x of v, report all periods of x (represented by disjoint arithmetic progressions). 
2-Period Queries 

Given a factor x of v, decide whether x is periodic and, if so, compute its shortest period. 


1 








Bounded Longest Common Prefix Queries 

Given two factors x and y of v, find the longest prefix p of x which is a factor of y. 

Known efficient algorithms for these types of queries apply orthogonal range searching queries 
in 2-dimensional rank space (i.e., coordinates of points are in the range [1, n], where n is the 
number of points, query rectangles are orthogonal). Three types of such queries were used to answer 
aforementioned queries: the range-emptiness queries ( rempt ) which ask if a query rectangle contains 
any of the given points, range successor queries ( rsucc) which ask for the smallest y-coordinate of a 
point in the rectangle, and range searching for minimum queries ( rmin ) where points are additionally 
equipped with weights and we ask for the minimum-weight point in the rectangle. Note that each 
of these problems generalizes the previous ones. 

The Period Queries problem was introduced and first studied in [22]. The solutions have 
O(logn) query time with 0(n\ogn) space, and 0(Q rsucc logn) query time with 0(n + S rsucc ) 
space. Currently the best trade-offs for the range successor queries are: Q rsucc = 0(log e n) for 
Srsucc = 0{n) [31], Qrsucc = O(loglogn) for Srsucc = 0(n log logn) [33], and Qrsucc = 0( 1) for 
Srsucc = 0(n l+£ ) [11]. PERIOD Queries, in spite of their very recent introduction, have already 
found applications. In [2] the authors use them to design factor suffix selection queries. In [23] they 
are used to compute all subrepetitions in a word, a notion extending the notion of a run in a word. 

The 2 -Period Queries problem is a special case of the Period Queries problem briefly 
introduced in [10], with a solution in 0(Q rm in) query time and 0(n + S rm in ) space. Here Qrmin = 
O(logn) for Srmin = 0(n) [28, 21], Q rm in = O(loglogn) for S rmm = 0(n\og £ n) [4], and Q rmm = 
0(1) for Srmin = 0(n 1+£ ) [11]. However, the solution to general PERIOD QUERIES presented in [22] 
implies slightly better trade-offs: 0(Q rsU cc ) query time with 0(n + S rs ucc) space, and 0(1) query 
time with O(nlogn) space. 

Bounded Longest Common Prefix Queries were introduced in [21] as a tool for the follow¬ 
ing Generalized Substring Compression Queries: given two factors x and y of v, compute 
the part of the LZ77 [34] compression LZ(y$,x) that corresponds to x, where $ ^ S. This problem 
was introduced in [6] and was also referred to as substring compression with an additional context 
substring. In [21] a solution to Bounded Longest Common Prefix Queries with 0(Q rm in + 
Qrempt log \p\) query time and 0(n + S rem pt + S rm in) space implied an 0{C(Q rmin + Q re mpt log ^)) 
time algorithm for Generalized Substring Compression Queries. The following trade-offs 
are currently known for range-emptiness queries: Qrempt. = 0(\og £ n) for S re mpt = 0(n) and 
Qrempt. — O (log log Tl) for Srempt — 0(jl log log Tl) [4], and Qrempt — 0(1) fol S r empt — 0(fl ^ ) [11]. 

As a by-product of [21], a solution to the following decision version of internal pattern matching 
problem is obtained: a data structure of size 0(n + S re mpt) that given factors x, y checks whether x 
occurs in y in 0(Q re mpt) time, provided that x is given by its locus in the suffix tree. All occurrences 
of x in y can be reported in additional time proportional to the number of these occurrences. 

1.2 Our Results 

We introduce queries that find all occurrences of one factor of v in another factor. 
Factor-in-Factor Occurrence Queries 

Given factors x and y of v with |y| < 2 |.t|, report all occurrences of x in y (represented as an 

arithmetic progression). 

Our main result is the following theorem together with its corollaries. 
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Theorem 1. Factor-in-Factor Occurrence Queries can be answered, in 0(1) time by a data 
structure of size 0(n), which can be constructed in 0(n) expected time. 

Remark 1. For Factor-in-Factor Occurrence Queries, the requirement \y\ < 2\x\ can 
be dropped at the cost of increasing the query time to 0(|j/|/|x|) and allowing several arithmetic 
progressions on the output. 

A number of applications of Factor-in-Factor Occurrence Queries are presented. We con¬ 
sider 2-Period Queries and Period Queries defined above as well as new types of queries. 

Prefix-Suffix Queries 

Given factors x and y of v and a positive integer d, report all prefixes of x of length between d 
and 2d that are also suffixes of y (represented as an arithmetic progression of their lengths). 

A word y is called a cyclic rotation of a word x if y = x\i + 1]... x[n]x[l] ... x[i } for some i. 

Cyclic Equivalence Queries 

Given factors x and y of v , decide whether x is a cyclic rotation of y and, if so, report all 
corresponding cyclic shift values (represented as an arithmetic progression). 

Corollary 1. Using a data structure of 0(n) size, which can be constructed in 0(n) expected time, 
one can answer: 

• Prefix-Suffix Queries in 0(1) time, 

• 2-Period Queries in 0(1) time, 

• Period Queries mO(log|x|) time, 

• Cyclic Equivalence Queries in 0(1) time. 

As we already mentioned, [2] and [23] rely on PERIOD QUERIES. Corollary 1 lets us improve 
these results as follows. For the factor suffix selection defined in [2], the query time with 0(n) space 
improves from 0(log 2+£ n) to 0(log 2 n). 

A 5-subrepetition is a generalization of the notion of a run for exponent at least 1 + 5, with 5 < 1. 
Thus runs are 1-subrepetitions. The best previously known algorithm for finding 5-subrepetitions 
[23] worked in 0(n\ogn + log ^) time. With our data structure it improves to 0(n + log |). 
In particular, we obtain the first linear time algorithm finding 5-subrepetitions for a fixed 5, which 
matches an 0(n) time algorithm for computing runs. 

Another application of FACTOR-IN-FACTOR OCCURRENCE QUERIES is the following. 

Corollary 2. Using a data structure of 0(n + S remp t + S rsucc ) size one can answer Bounded 
Longest Common Prefix Queries in 0(Q rsucc + Q re mpt log log \p\) time. 

The data structure of Corollary 2 yields a solution to GENERALIZED SUBSTRING COMPRES¬ 
SION Queries with 0(C(Q rsucc + Qrempt log log query time, as compared to 0(C(Q rm i n + 
Qrempt log y=r)) time queries of [21]. Here C is the number of phrases reported. 

1.3 Our Techniques 

We use in a completely novel way a classic approach to pattern matching by sampling. To search for 
occurrences of x in y, both of which are factors of v, we assign to x a sample which is also a factor 
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of x. Then we find occurrences of this sample in y and check which of them extend to occurrences 
of x in y. 

The sample has length 2L lo e 1^1—!J (it is a so-called basic factor ). It may be either non-periodic , 
in which case y contains only a constant number of occurrences of this sample, or periodic , in which 
case all its occurrences can be located en masse with the aid of a unique maximal repetition (a run ) 
induced by the sample in x. 

In the non-periodic case we use a probabilistic argument to make sure that the expected number 
of different samples of factors of v is 0(n). We identify such samples using a space-efficient variant of 
the Dictionary of Basic Factors data structure (which is normally of size 0(n log n)). In the periodic 
case we precompute the structure of all runs in the word to be able to efficiently find all runs in y 
that are consistent with the run of the sample. This consistency-check is based on combinatorial 
properties of Lyndon words. The space bound of our data structure relies on a known fact that the 
sum of exponents of runs in a word is linear. 

1.4 Organization of the Paper 

We start with recalling basic notions of combinatorics on words in Section 2. Afterwards in Section 3 
we present the main ideas of our data structure and in Section 4 we develop the necessary proba¬ 
bilistic tools. The main parts of the data structure for Factor-in-Factor Occurrence Queries 
are described in Sections 5, 6 and 7. In Section 5 we cover non-periodic case although to simplify 
presentation we only describe the case of a square-free text. Then in Section 6 we show the solution 
of the periodic case. Section 7 presents a general solution that combines the techniques from the 
two previous sections (actually the results of Section 6 are used as a black-box while for the tech¬ 
niques of Section 5 we apply a few minor modifications). Full descriptions of the data structures 
for periodic and general case are given in Sections 8 and 9. Later, in Section 10, we describe a 
linear construction algorithm of the whole data structure. We conclude the paper with a detailed 
presentation of applications of Factor-in-Factor Occurrence Queries (Section 11). 

2 Preliminaries 

Consider a word v = u[l]u[2] ... v[n] of length |u| = n, where v[i] € S. For 1 < i < j < n, a word 
u = v[i] ... v[j\ is called a factor of v. By v[i,j] we denote an occurrence of u at position i called a 
fragment of v. Throughout the paper by [i, j] we denote an integer interval {i,... ,j}. 

The following fact specifies a known efficient data structure for comparison of factors of a word. 
It consists of the suffix table with its inverse, LCP table and a data structure for range minimum 
queries on the LCP table, see [7, 14]. 

Fact 1 (Equality Testing). Let v be a word of length n. After 0(n) preprocessing, one can test in 
0(1 ) time whether two given fragments are occurrences of the same factor. 

A fragment of v of the form BFk(i) = v[i, i+2 k — 1] is called a k-basic fragment. By = n—2 k +l 
we denote the number of £:-basic fragments of v. A factor that occurs as a L-basic fragment is called 
a k-basic factor. By we denote the number of different /c-basic factors of v (m*, < n*,). 

The dictionary of basic factors (DBF in short), see [7, 14], consists of |_logrij layers. The k- th 
layer is a table DBF & such that DBFk[i\ is an identifier of BFk(i). The identifiers are consecutive 
positive integers that satisfy DBF^[i] < DBF^[i'] if and only if BF^(i) < BF^i!). 
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We say that a positive integer p is a period of v if v\i\ = v[i + p] holds for all i £ [1, n — p\. The 
shortest period is denoted as per(w). We call v periodic if 2per(t>) < |t>| and primitive if per(u) is 
not a proper divisor of |u|. 

A run (a maximal repetition) in v is a periodic fragment a = v[i,j] which cannot be extended 
neither to the left nor to the right without increasing the shortest period p = per (a), that is, 
v[i — 1] 7 ^ v[i + p — 1] and v[j — p + 1] v[j + 1], provided that the respective letters exist. We 
define the exponent of a run as exp(a) = . In our algorithms runs are represented together 

with their periods. 

Example 1. The word v = baababaababb contains three runs with period 1: aa twice, as t>[2,3] 
and as v[7, 8 ], and v\\l, 12] = bb; two runs with period 2: v[3, 7] = ababa and u[ 8 ,11] = abab; one 
run with period 3: u[5,10] = abaaba; and one run with period 5: u[l, 11] = baababaabab. 

The structure of runs in a word can be used to represent all repetitions in a word in a compact 
way. The following fact gathers deep combinatorial and algorithmic results concerning runs useful 
throughout the paper. 

Fact 2 ([24, 25, ?, 12, 8 ]). In a word of length n both the number of runs and the sum of their 
exponents are 0(n). Moreover, all the runs in a word can be computed in 0(n) time. 

The following fact states, in particular, that the result of FACTOR-IN-FACTOR OCCURRENCE 
Queries is well-defined. Its proof is given in Section 8.1. 

Fact 3. Let x, y be words with \y\ < 2|x|. Then the set of positions where x occurs in y forms 
a single arithmetic progression. Moreover, if there are at least 3 occurrences, the difference of this 
progression is per(x). 

3 Our Approach 

We sketch our approach for answering FACTOR-IN-FACTOR OCCURRENCE QUERIES focusing on 
the non-periodic case. For simplicity we assume in this section that v is square-free. Recall that a 
word u is called a square if u = ww for a word w, and a word is called square-free if it does not 
contain any squares as factors, see also [27]. 

We call a set ICMa A -sparse set if for any distinct elements a, b € X, \a — b\ > A. 

Observation 1 (Sparsity of Occurrences). Let u andv be words and assume v is square-free. Then 
the set of positions where u occurs in v is |u| -sparse. 

In particular, Factor-in-Factor Occurrence Queries for a square-free word V return at 
most one occurrence. 

The first approach to solve FACTOR-IN-FACTOR OCCURRENCE QUERIES in the square-free case 
using the idea of samples could be as follows. For each of the basic factors of v we store a sorted 
list of all its occurrences. The sample x' for a query pattern x is its prefix being a [log |x|J-basic 
factor. Using the precomputed lists we find all occurrences of x 1 in y, by Observation 1 there are 
at most 3 such occurrences. Afterwards standard techniques (Fact 1) let us verify which of them 
extend to occurrences of x. This approach requires 0(nlogn) space due to the number of possible 
samples. To obtain 0(n) space, we will perform a more careful selection of samples so that their 
total number is linear. In the following definition we slightly change the approach and compute 
samples only for basic fragments of v. 
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Definition 1. Let v be a square-free word. We call repr k a A:-representative assignment for v, if 
the following conditions are satisfied: 

1. repr k : [l,nfc+i] —>• [1 ,n k ], i.e. repr k assigns to each (A+l )-basic fragment a k-basic fragment, 

2. repr k (i) £ [i, i + 2 k ] for each i, i.e. each fragment is assigned one of its subfragments, 

3. repr k (i)—i = repr k {i')—i' if BF k+ i(i) = BF k+ i(i'), i.e. if two basic fragments are occurrences 
of the same factor, their corresponding subfragments are assigned. 

The values of A-representative assignment are called k-representative positions or representative 
occurrences of the corresponding A-basic factors. The set of A-representative positions is denoted as 
Repr k . We say that a A-basic fragment is representative if it starts at a A-representative position. 

In the data structure we store only the representative occurrences of all A-basic factors. Note 
that for a fixed basic factor this set might be empty or contain some occurrences, but not necessarily 
all of them. Thus property (3) is crucial for the correctness of our approach to queries. 

Let S m be the set of all permutations of {1,..., m}. For a permutation ir k £ S mk we set 

repr k (i) = argmin {ID k [j] : j £[i,i + 2 k }} (1) 

where ID k [j] = n k (DBF k [j]). It turns out that this definition satisfies the conditions for a repre¬ 
sentative assignment and, moreover, one can choose ir k so that the total size of Repr k is 0(n). The 
formal proof, which uses probabilistic results of Section 4, is postponed until Section 5. 

Let us conclude with a sketch of our approach to constructing repr k . We later show that given 
any superset of Repr k it is easy to construct repr k . We construct a candidate set C k , which is 
a small superset of Repr kl in two steps. First, we find A k = {j £ [l,rifc] : ID k [j] < £ k } for an 
appropriate parameter I k . Then we extend it, setting C k = FillGaps(Afc, 2 k , [1, n k ]), where FillGaps 
is defined as follows: 

FillGaps(A, A, I) = A U |J{ [i, i + A] : [i, i + A] C / \ A}. 

7 9 13 20 21 26 32 34 

A oooooo»o«ooo«oooooo»«oooo»ooooo»o« 

Figure 1: C = FillGaps(vl, 4, [1,34]). Here A = {7,9,13,20,21,26,32,34} and C is obtained from 
A by inserting all maximal subintervals of the domain that are disjoint with A and contain more 
than 4 integers (in this example, 17 elements are inserted). 

As we shall see in Section 4 in a more abstract setting, the set C k generated in this way is always 
a superset of Repr k and its expected size is O(^). Details of the construction algorithm, already 
in the general case, are presented in Section 10. 

4 Probabilistic Tools 

Let a be a sequence of length n over [1, m] and let 7 r £ S m . We say that the sequence a is A-diverse 
if for each element a £ [1, m] the set {i : a; = cr} is A-sparse. Fix a positive integer A and for 
i £ [1, n — A] define 

f K (i) = argmin{ 7 r(a j ) : j £ [i,i + A]}. 
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In case of ties, which are possible if a is not A-diverse, we take the leftmost index. We say that the 
values of f n are local n-minima. 

Consider a function g defined on an interval \£,r\. We define a piecewise constant representation 
of g as a collection of triples such that g{x) = u, for x € [C r i] and : i} is a 

partition of \£, r]. The size of the representation is the number of triples. 

Example 2. An illustration of f n for m = 4, A = 4, tt = (3,2,1,4) and an example sequence a. 
Shades of gray represent intervals in a piecewise constant representation of f n : (1,2,2), (3,4,4), 
(5, 6, 7), (7, 9,11). Note that a is 2 -diverse. 
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Lemma 1 . Assume that a. is a y -diverse sequence of length n over [1, to] and let n be a permutation 
of [l,m] drawn uniformly at random,. Let A = {i : 7r(a.;) < £} for a parameter £, and C = 
FillGaps(A, A, [1, n]). Then 


(a) C contains all local tt -minima, 


(b) ifl = 


2m log A 
A 


then E[|Cj] = 


O(^). 


Here we prove (a) only, the proof of (b) is presented in Appendix A. 


Proof of (a). Assume j = f n (i) is a local 7r-minimum. If vr(aj) < £, then j € A, so in particular 
j € C. Otherwise, not only 7r(a j) > l, but for any i! € [i,i + A] we have 7r(aj/) > £. Thus 
[i,i + A] C [l,n] \ A, i.e. the FillGaps operation adds this interval, and in particular j, to C. □ 

Corollary 3. Let a be a y- diverse sequence and let tt be a permutation of [1, m] drawn uniformly at 
random,. Then the local ir-minim.a function f n admits a piecewise constant, representation of expected 
size bounded by 

Proof. By Lemma 1 the expected number of local tt- minima is O ( -^ A ). To obtain the same bound 
for the size of a piecewise constant representation, it suffices to prove that is non-decreasing. For 
a proof by contradiction assume j' = f n (i') < /tt(*) = j for i < i!. Note that i < i' < j' < j < 
i + A < i' + A. Consequently j' € [i,i + A] and j € [i',i' + A]. The former implies vr(aj) < 7r(a ji) by 
definition of /^(i) while the latter implies 7r(aj/) < 7r(a j) by definition of f n (i'), a contradiction. □ 


5 Square-Free Case 

In Section 3 we have already presented a rough description of our data structure for the square-free 
case. In this section we fill in the missing details of the data structure and the query algorithm in 
this case. We start with a justification of our selection of a representative assignment (1). 

Lemma 2. Let v be a square-free word. There exist permutations tt^ £ S mk such that. repr k given by 
(1) form a family of representative assignments for v and admit piecewise constant representations 
of total size 0(n). 
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Proof. First, let us prove that repr k given by (1) is a valid representative assignment with respect 
to Definition 1. Clearly properties 1 and 2 are satisfied. For a proof of 3 it suffices to note that 
BF fc+i(f) = BF k+ \(i') implies that for any 5 € [0, 2 k ] we have BF k (i + S) = BF k (i' + (5), therefore 
the minimum in (1) for repr k (i) and repr k (i') will represent the corresponding /c-basic fragments. 

We apply the probabilistic method to show that one can choose n k so that repr k admits a 
piecewise constant representation of size O(^). Note that Observation 1 implies that DBF k is a 
2^-diverse sequence. Moreover, we have defined repr k as a function assigning local Tr^-minima for 
the sequence DBF k with A = 2 k . Thus, by Corollary 3, if n k is drawn uniformly at random, the 
expected size of the piecewise constant representation of repr k is 0(^). In particular, one can 
choose 7 T k so that this quantity is actually 0(t£). Summing up over all /c’s we obtain the desired 
0(n) bound. □ 

The solution of for the square-free case requires two auxilary abstract data structures. Their imple¬ 
mentation is described in Appendix B. 

Evaluator 

Input: A function g : [l,n] -A U that admits a piecewise constant representation of size m (the 
elements of U fit in 0(1) words). 

Queries: Given i compute g(i). 

Lemma 3. For g specified in a piecewise constant representation of size m, there exists an evaluator 
£(g) of size 0(m + that answers queries in 0(1) time and can be constructed in 0(m+ ^^) 
time. 

Locator 

Input: An indexed family A = {Ai) °f d-sparse subsets of [l,n]. 

Queries: Given an index i and a range P of length 0(d) return Ai n P. 

Lemma 4. For a family A = (Af) there exists a locator C(A) of size 0(’^f i \Ai\) that can answer 
queries in 0(1) time. It can be constructed in 0('ff i |Mj|) expected time given {( i,j ) : j € Ai}. 

Now we are ready to provide a rigorous description of our data structure for the square-free case. 

Theorem 2. For a square-free word v of length n there exists a data structure of 0(n) size that 
can answer FACTOR-IN-FACTOR OCCURRENCE QUERIES in 0(1) time. 

Proof. The data structure consists of [lognj layers. For each layer k we store an evaluator of the 
representative assignment and a locator of representative positions, that is £(repr k ) and C(A k ), 
where A k , is the set of representative occurrences of the /c-basic factor whose identifier ID k is id. 
In the evaluator we additionally store the identifiers ID k of the representative positions. We also 
maintain the global data structure specified in Fact 1. If repr k is chosen so that it satisfies Lemma 2, 
the total size of piecewise constant representations is 0(n), hence \Repr k \ = 0(n) and the total size 
of A k is 0(n). This concludes that the total size of evaluators and locators is 0(n). 

The query algorithm for x = v[£,r] and y = v[I',r'\ works as follows. If |x| = 1, we use a 
naive algorithm (compare x with each letter of y). Otherwise we use the data structures for the 
k- th layer, for k = [log |cc| — lj: we use £(repr k ) to obtain j = repr k (l) and id = ID k [j], We set 
5 = j — £ and use C(A k ) to compute A k d 0 [£' + (5, r' + 1 — |x|], i.e. the representative occurrences 
of the /c-basic factor BF k (j) which might be induced by an occurrence of x in y (see Figure 2). We 
obtain a constant number of them and use Fact 1 to detect which extend to an actual occurrence 
of x. □ 
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Figure 2: The query algorithm finds j = repr k (£) and the identifier id of BFk(j), here depicted 
with a gray rectangle. Then, it finds all representative occurrences of id, which might be induced by 
occurrences of x in y. Here these occurrences he at positions occi and occ 2 - Potential occurrences 
of x are marked with dashed rectangles. 

6 Overview of Periodic Case 

In the periodic case of Factor-in-Factor Occurrence Queries we assume that x contains 
a periodic /c-basic factor. In the following subsection we introduce a notion of a /c-run, a central 
notion in the solution of the periodic case. Afterwards we show how to answer queries in this case. 

6.1 Repetitive Structure of Words 

We say that a run a extends a fragment u if u is a subfragment of a and per(u) = per(a). For any 
periodic fragment u, there is a unique run a extending u, we denote it as run(u) (see Figure 3). If 
u is not periodic, we set run(u ) = _L. The following lemma, proved in Appendix C, might be of 
independent interest. 


P 



a 


Figure 3: run(u ) = a. If u is a /c-basic factor then a is a k- run. 

Lemma 5. There exists a data structure of 0(n ) size, which given a fragment u returns run(u ) in 
constant time. Such a data structure can be constructed in 0(n) time. 

A run a is called a k-run if |a| > 2 k and per(a) < 2 k . Alternatively, a run is a k -run if it 
extends a periodic fragment of length between 2 k and 2 k+1 — 1. Note that the definition implies 
that if run(u ) = a/l, then a is a /c-run for k = [log |tt|J. 
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aaba.aba.aba.abaa.ba.ab 
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Figure 4: A sample run a with |a| = 19, per(a) = 3 and exp(a) = 6|. It is a /c-run for k = 2,3,4. 

By 1Z{v) we denote the set of all runs in v, and by 1Zk{v) the set of /c-runs in v. Note that 
U IZkiy) = lZ(v), but the sum is not necessarily disjoint. A fixed run a can be a k -run for at most 
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exp(a) values of k (see Figure 4), which by Fact 2 implies that 'fff, \lZk(v)\ = 0(n). In Appendix C 
we prove a stronger property of £:-runs, from which we derive the 0{n) space bound for several 
components of our data structure. 

A word that is both primitive and lexicographically minimal in the class of its cyclic rotations is 
called a Lyndon word. Let u be a word with period p. The Lyndon root A of u is the Lyndon word 
that is a cyclic rotation of u[l,p]. Then u can be represented as \'\ k \" where X' is a proper suffix 
of A, and X” is a proper prefix of A. The Lyndon representation of u is defined as (| A r |, A;, |A ,r |). As 
shown in [10], Lyndon representations of all runs can be computed in 0(n) time. We say that two 
runs are compatible if they have the same Lyndon root. 

6.2 Queries 

Let xdy denote the common subfragment of overlapping fragments x, y. The following observation 
shows why /c-runs are useful in the solution of the periodic case. 

Observation 2. Assume that x = r] and x' = v[f ,r'\ are occurrences of the same factor and 
that x has a periodic k-basic fragment z. Let a = run(z). Then there exists a k-run a! that is 
compatible with a, such that xH a and x 1 0 of are the corresponding subfragments of x and x'. 


X 



x n a 



x 1 n of 






Figure 5: Synchronization of runs on two occurrences of the same factor (Observation 2). The 
runs may have different lengths but their intersections with the occurrences of the factor start and 
end at the same positions relative to these occurrences. 


Observation 2 lets us take the following approach for finding x in y. First, we locate z, a periodic 
/c-basic fragment of x, and compute the fc-run a = run(z) using Lemma 5. Then, we find all /c-runs 
of which are compatible with a and intersect y. Now we consider two cases. 

If a does not cover x , i.e. x PI a x, then the corresponding k- run a! also does not cover x'. 
Since x n a and x' (~l of must be corresponding subfragments of x and x' respectively, given of we 
have a single possible position of x'. Thus, for a fixed of it suffices to apply Fact 1 for at most 
one candidate (some might be already out of consideration as exceeding y). On the other hand, 
if a covers x, then any occurrence x' of x is guaranteed to be covered by the corresponding k- run 
a ', in particular x' C y n of. If | y 0 a!\ < |x|, then clearly there is no such x'. Otherwise, we find 
the occurrences of x in y 0 of (already represented as an arithmetic progression) using the Lyndon 
representations of both fragments, see Section 8 for details. 

We get at most one arithmetic progression for each k- run of. By Fact 3, these progressions can 
be combined to a single arithmetic progression. 
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7 Overview of General Case 


The central notion of Section 5 was that of a representative assignment. If we generalized Definition 1 
to arbitrary words with periodicities directly, the representative assignment would require O(nlogn) 
space, e.g. for the word v = a". However, if the query pattern x = v[£,r\ contains a periodic k- 
basic factor with k = [log |x| — lj, we can apply the data structure for periodic case to answer 
this query. Thus, if BF k +\ (£) contains a periodic k- basic fragment, we leave repr k {£) undefined 
(denoted repr k {£ ) = _L) and modify the query algorithm, so that it launches the data structure for 
periodic case whenever it gets an undefined representative. 

Observation 1 does not hold for arbitrary word v. If we assume that u is non-periodic, we obtain 
a slightly weaker result. 

Observation 3 (General Sparsity of Occurrences). Let u and v be words and assume u is not 
periodic. Then the set of positions where u occurs in v is ^ -sparse. 

Definition 2. Let v be an arbitrary word. We call repr k a A;-representative partial assignment for 
v, if the following conditions are satisfied: 

1. repr k : [l,njt+i] —>• [1, n&] U {T}, i.e. if defined, repr k assigns to each (k + \)-basic fragment 
a k-basic fragment, 

2. repr k {i) € [i,i + 2 k ] U {_L} for each i, i.e. each fragment is assigned one of its subfragments 
or _L, 

3. repr k {i ) — i = repr k {i') — i' or repr k {i) = repr k {i!) = _L if BF k+ \{i) = BF k+ \{i'), i.e. if two 
basic factors are equal, either the corresponding subfragments are assigned or both representa¬ 
tives are undefined. 

4- repr k (i) =1 if and only if BFk+i(i) contains a periodic k-basic factor. 

The notions of representative positions Repr k and representative occurrences carry on. Gener¬ 
alizing the approach presented in Section 3, we choose permutations n k € S mk , define ID k [j] = 
7 r k (DBF k [j}) and set 


repr k (i) 


_L if BF k (j) is periodic for some j € [i,i + 2 fc ], 

argmin{IDfc[j] : j G [i, i + 2 k ]} otherwise. 


( 2 ) 


Like in the square-free case, we can choose 7 Tk so that repr k admit piecewise constant represen¬ 
tations of total size 0(n), see Section 9 for details. 

The main idea of the construction algorithm remains unchanged: we are still looking for a 
candidate set Ck that is a small superset of Repr k . As previously we start with A k = {j : ID k \j ] < 
£k} but we need to use FillGaps much more carefully: instead of U = [1, rife], we apply it separately 
for each maximal block of positions where non-periodic /c-basic fragments start. In particular, we 
need to use some techniques from the periodic case to find these blocks and to identify positions i 
where repr k (i) is defined. The construction algorithm is presented in detail in Section 10. 


8 Full Description of Periodic Case 

In Section 8.1 we introduce a number of combinatorial and algorithmic tools, among which the 
most important is the so-called fc-RUN LOCATOR. We also give a proof of Fact 3 which states that 
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0(l)-sized answers to FACTOR-IN-FACTOR OCCURRENCE QUERIES always exist. Then we describe 
the complete data structure for the periodic case in Section 8.2. 

8.1 Toolbox 

We start by recalling two classic lemmas. We use them to prove Fact 3 and to introduce an additional 
combinatorial tool for the periodic case (Fact 4). 

Lemma 6 (Periodicity Lemma [15, 29]). Let v be a word with periods p and q. If p-\-q < |e|, then 
gcd (p,q) is also a period ofv. 

Lemma 7 (Synchronization of Primitive Words [7]). Let X be a primitive non-empty word. Then 
X has exactly two occurrences in XX. 

Fact 3. Let x, y be words with \y\ < 2|x|. Then the set of positions where x occurs in y forms 
a single arithmetic progression. Moreover, if there are at least 3 occurrences, the difference of this 
progression is per(x). 

Proof. Assume that x occurs in y at positions b < *2 < • • • < im■ If m < 2, the conclusion of the 
fact is trivially satisfied, so assume that m > 3. Let x = q k q ', where |c/| = per(x) and q' is a proper 
prefix of q. Note that if x occurs in y at positions i, i! with i < i' < i + \x\, then i! — i is a period 
of x. Moreover, XyX 1 (b'+i — = im — i\ < |y| — \x\ < |x|. Therefore for each j € [1 ,m — 1] 

the value ij+i — ij is a period of x and by the Periodicity Lemma (Lemma 6) it is a multiplicity 
of per(x'). It suffices to show that it is actually equal to per(x). Let us fix j G [l,m — 1] and let 
I = ’ n °f e that I is an integer in [1 ,k]. Consider a word z = y[ij,ij +1 + |x|]. Observe that 

z = (fx = q k+e q'. Clearly, x occurs in z at position 1 + per(x). Consequently x occurs in y at 
position ij + per(x), which implies that ij + \ — ij = per(x). □ 

Fact 4. Let x and y be periodic with common Lyndon root. Then the set of positions where x occurs 
in y is an arithmetic progression that can be computed in 0(1) time given the Lyndon representations 
of x and y. 

Proof. Let A be the common Lyndon root of x and y and let their Lyndon representations be (p, k, s ) 
and (p',k',s') respectively. Synchronization Property (Lemma 7) implies that A occurs in y only at 
positions i such that i = p' + 1 (mod |A|). Consequently, x occurs in y only at positions i such that 
i = p' — p+ 1 (mod |A|). Clearly x occurs in y at all such positions i in [1, |y| — |x| + 1]. Therefore 
it is a matter of simple arithmetics to compute the arithmetic progression of these positions. □ 

Apart from the data structure computing run(u), the run extending u for arbitrary periodic factor 
u, in the periodic case we also need a couple of additional data structures. Their implementation is 
provided in Appendix C. 

k- Run Locator 

Input: A word v of length n. 

Queries: Given an integer p and a range P C [l,n] with |P| = 0(2 k ), compute all a € 7Zk(v) 
for which per(a) = p and a Cl P 0. 

Lemma 8. There exist k-run locators /Q,(e) that answer queries in 0(1) time, take 0(n) space in 
total, and can be constructed in 0(n) expected total time. 
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Let R be a finite set of integers. We call B a block of R if B is an inclusion-wise maximal interval 
contained in R. A block representation of R is a sorted list of all its blocks. 

Lemma 9. For k £ [0, [lognj] let P/. = {?' £ [l,Rfc] : BF *,(/) is periodic}. In 0(n ) time we can 
compute all sets P^, each of them . represented both in the block representation and as a bit vector. 

We also use the classic successor queries. The following version with efficient construction 
algorithm is described in Appendix B. 

Lemma 10. For an arbitrary set R C [l,n] there exists a data structure S(R) of size 
which answers successor queries on R (succr(i) = min(i? n [i,n])) in 0(1 ) time. Moreover S(R) 
can be constructed in 0(^0) time if R is given as a bit vector. 

8.2 Data Structure 

The data structure consists of a global part, consisting of the set IZ(v) of all runs together their 
Lyndon representations, as well as the data structure for computing run(u ) (see Lemma 5) and the 
data structure of Fact 1 to check occurrences. It also contains [log nj layers, the k- th one consists of 
the data structure for successor queries for Pf ; = {i £ [l,rafc] : BF k(i) is periodic} (see Lemma 10) 
and the fc-run locator JCk(v) (see Lemma 8). 

The query algorithm for x and y has already been sketched in Section 6. Here we provide the 
missing implementation details. Successor queries on P). let us find the smallest j > i such that 
BFk(j) is periodic. In particular they allow to find z, the leftmost fc-periodic basic subfragment of 
x. Once we get z, Lemma 5 is used to compute a = run(z). Then we use JCk(v) to compute the 
/c-runs of which intersect y and satisfy per(c/) = per(a). This is done in constant time, which in 
particular implies that the number of such /c-runs of is constant. For each a' we start with verifying 
if a and of compatible, i.e. whether their Lyndon roots ara equal. We use the precomputed 
Lyndon representations to localize the Lyndon roots as fragments of v and then simply check if 
they are occurrences of the same factor using Fact 1. For /c-runs a' which pass this test we proceed 
as described in Section 6, either checking a single candidate position (if x fl a ^ x) or Lyndon 
representations otherwise. In the latter case if |y fl of\ > |x| we derive the Lyndon representations 
of x fl a and yfla' from the representations of a and of and apply Fact 4. Finally we merge several 
arithmetic progressions into a single one, which is simple since Fact 3 guarantees that their union 
indeed forms a single arithmetic progression. 

The construction algorithm works as follows. The set TZ(v) is computed using the algorithm 
of Kolpakov & Kucherov [24] using the results of Crochemore & Ilie [8] so that it does not require 
constant-sized alphabet to run in 0(n) time. Then we use the algorithm of Crochemore et al. [10] 
to compute Lyndon representations of runs. We also launch Lemma 9 to compute Pk represented 
as bit vectors and pass them to the algorithm constructing S(Pk) (Lemma 10). The construction of 
remaining components comes down to running the algorithms provided by the appropriate lemmas. 
Consequently, we obtain the main result of this section. 

Theorem 3. There exists a data structure ofO(n ) size which can be constructed in 0(n ) expected 
time that answers Factor-in-Factor Occurrence Queries in 0(1) time provided that the 
pattern x contains a periodic k-basic factor for k = [log |x| — lj. 
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9 Full Description of General Case 


In this section we complete the description of the data structure for the general case of FACTOR- 
in-Factor OCCURRENCE Queries. We prove Lemma 11, which is a general version of Lemma 2 
that worked only in the square-free case. The construction algorithm of our data structure is shown 
in the next section. 

Lemma 11. Let v be an arbitrary word. There exist permutations n k € S mk such that repr k 
given by (2) form a family of representative partial assignments for v and admit piecewise constant 
representations of total size 0(n). 

Proof. It is easy to see that repr k given as (2) satisfies the conditions required for a representative 
partial assignment (see Definition 2). Again, we will use the probabilistic method to show that if 
7Tfc is drawn uniformly at random, then the expected size of a piecewise constant representation of 
repr k is 0(^-). This will imply that for some choice of n k the actual size is O(t^), which sums up 
to 0(n) for all values of k. 

Here, unlike in the proof of Lemma 2, we cannot simply use Corollary 3 to bound the expected 
size of representations. First, let us prove that the number of blocks of consecutive of _L’s in repr k 
IS O(jjr). Observe that if BF k (j) is periodic, then repr k (i ) = T for all i € [j — 2 k ,j\ n [l,rifc]. Thus 
each block of _L’s, possibly except for the first and the last one, contains at least 2 k + 1 positions. 
Consequently, the number of such blocks is at most 2 + 

Now, for each maximal block B = [i, i'] of positions where repr k is defined, we consider an 
extension B' = [i. i' + 2 k } and the sequence of DBF k [j] for j € B'. Since repr k is defined for 
all positions in B , all fc-basic factors whose identifiers occur in this sequence are non-periodic. 
Consequently, this sequence is ^-diverse by Observation 3. Moreover, repr k restricted to B simply 
assigns local 7Tfc-minima for that sequence, so we can use Corollary 3 to deduce that the expected 
size of piecewise constant representation of repr k restricted to B is k ) = 0(k + ^r)- Now 

consider all such blocks B. Their number is (proportional to the number of blocks of _L’s) 

and their total size is obviously 0(n). Thus by linearity of expectation we can bound the expected 
size of a piecewise constant representation of the whole repr k by O((0 ). □ 

Theorem 4. For any word v of length n there exists a data structure of size 0(n) that can answer 
Factor-in-Factor Occurrence Queries in 0(1) time. 

Proof. The proof is analogous to that of Theorem 2. Apart from the data structure of Theorem 3 for 
the periodic case, we use the same components: the data structure from Fact 1, evaluators £(repr k ) 
and locators C(A k ) (constructed for partial representative assignments repr k ). The locators C(A k ) 
are now defined for d = 2 k ~ 1 , with 2 fc ~ 1 -sparsity of A k d being a consequence of Observation 3. 

Relative to the square-free case, there are just two differences in the query algorithm for x = 
v[£, r] and y = v\P ,r'\. First, if repr k (i ) (for k = |_log |a?| — lj) turns out to be undefined, we 
find out that we are in the periodic case, so we launch the component responsible for that case. 
The second modification concerns the very last step, returning the output. In Section 5 we were 
guaranteed that there is at most one occurrence, so returning it as an arithmetic progression was 
trivial. Here, we either pass the arithmetic progression obtained from the periodic case, or obtain 
a constant number of occurrences, which, by Fact 3, also form an arithmetic progression. □ 
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10 Construction of the Data Structure 

We start with introducing additional algorithmic tools used solely in the construction algorithm. 
Their detailed implementation is provided in Appendix D. Afterwards we proceed with a description 
of the construction algorithm for the general case of FACTOR-IN-FACTOR OCCURRENCE QUERIES. 

10.1 Space-efficient DBF 

The standard implementation of the Dictionary of Basic Factors uses ©(nlogn) space [7, 14]. We 
introduce its compact version which provides the same operations as regular DBF but with linear 
space and construction time. Its main component is a data structure of Gawrychowski [17] which 
efficiently locates basic factors in the suffix tree. 

CompactDBF 

Input: a word v of length n 

Queries: for an integer k: 

(1) given a position i return DBF k [i], 

(2) given an identifier j return {i : DBF k [i] = j }, 

(3) return m k , the number of distinct identifiers in DBF k . 

Lemma 12. For a word v of length n there exists CompactDBF D(v) which takes 0{n ) space, 
can he constructed in 0(n) time and can answer (1) and (3) queries in 0(1) time, and (2) queries 
with 0(1) time delay per item, reported. 

Recall that in order to define repr k we have used identifiers ID k [j] = 7 j k (DBF k \j]), where 7r k 
is a permutation of S mk . RANDOMIZEDDBF is a modification of COMPACTDBF, which instead 
of DBF k [i\ operates on ID k [i\ = TTk(DBF k [i]) as identifiers for queries (1) and (2), where for each 
level k, 7Tfc is drawn uniformly at random from S mk . 

Lemma 13. For a word v of length n there exists a RandomizedDBF D*(v) which takes 0(n) 
space, can be constructed in 0(n) expected time and can answer (1) and (3) queries in 0(1) time, 
and (2) queries in with 0(1) time delay per item reported. 

RandomizedDBF is the source of randomization in our construction algorithm. Note that 
for different values of k, permutations TT k are not independent. Actually, we could not draw them 
independently, as this would require 12(n log 2 n) bits of randomness as opposed to 0 (nlogn) we can 
get during the 0(n) time construction. 

10.2 Computing Candidates 

As indicated in Sections 3 and 7, the crucial step of the construction algorithm is building candidate 
sets C k , supersets of Repr k whose total size is 0(n). 

Lemma 14. Let v he an arbitrary word of length n. There exists an algorithm which returns sets 
Ck C [l,rifc] together with identifiers ID k \j\ for j £ Ck such that: 

• there exist permutations 7 T k £ S mk such that ID k [j] = 7 T k (DBF k [j]) and, for the representative 
partial assignment repr k defined using (2) with these identifiers, Ck D Repr k . 

• nEk\Ck\) = 0(n). 
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The expected running time of the algorithm is 0(n + \Ck\) = 0(n). 

Proof. The algorithm is based on an instance of RANDOMIZEDDBF T>*(v), which in particular 
makes a random choice of the underlying permutations We start with presenting our construction 
of sets Ck, then prove its correctness using the results of Section 4 and conclude with providing an 
efficient implementation of our algorithm. 

Recall that in Section 8 we have defined Pk as the set of positions j such that BF k(j) is periodic. 
By Nk we denote its complement [l,n*,] \ Pk- The candidate sets Ck are constructed in two steps. 
First we set A k = { j E TV k : ID k [j] < d k } for ik = |^Tr . Then for each block B of Nf. we compute 

FillGaps(^4fc n B, 2 k , B) and return the union of these sets as Ck- 

Consider a block B' of the set of positions where repr k is dehned. Note that if we extend B by 
2 k positions to the right, we obtain a block B’ of TV^. Moreover, repr k for positions in B' assigns 
local 7Tfc-minima for DBF k restricted to B. By Observation 3, DBF k restricted to B' is 2 k ~ l - 
sparse. Consequently we can apply Lemma 1, which gives E[|R 0 Ck\] = By linearity of 

expectation we conclude that E[|Cfc|] = 0(~^) and E[^ fc \Ck\] = 0(n). This proves the correctness 
of our algorithm, it remains to provide an efficient implementation. 

Lemma 9 lets us efficiently compute sets Pk, both in a block representation and as bit vectors. 
The former can be easily transformed to a block representation of TV*, and the latter can be used to 
test in 0(1 ) time for j E [1,TT-fc] whether BFk(j) is periodic. We construct Ak separately for every 
k. We use a type (3) query on T>*(v) to determine m*,, which is necessary to compute Then for 
each identifier < we use a type (2) query on T>*(v) to get one occurrence of the corresponding 
fc-basic factor. We use the bit-vector representation of Pk to test if this /c-basic factor is periodic. 
For non-periodic factors we proceed with the execution of the type (2) query adding to A k all the 
positions where the factor occurs. This way Ak is constructed in 0(l + \Ak\ + ) = 0(|Afc| + 

time. Then we simultaneously sort all Ak s, which increases the cost by a single 0(n ) term. 

Once Ak are sorted we apply the FillGaps operations, again independently for each k. We 
simultaneously traverse Ak and the blocks of TV/,. This lets us determine all blocks of \ Ak, 
and add to Ck all elements of those blocks of size at least 2 k + 1. This is equivalent to running 
FillGaps(Mfc n B,2 k ,B) separately for every block B of TV*,. Apart from 0(\Ck\) time to traverse 
Ak and actually fill the gaps, this procedure requires additional time proportional to the number of 
blocks of TV*,. However, since these representations for all sets Nk were constructed in O(n) total 
time, this extra cost sums up to 0(n). Finally, we equip each j E Ck with IDk [j] using type (1) 
query on T>*(v). □ 

Unfortunately, the bound on \Ck\ provided by Lemma 14 holds only in expectation, and we 
are to construct a data structure with a guaranteed 0(n) size bound. Nevertheless it easy to modify 
this algorithm so that \Ck\ is guaranteed to be 0(n). 

Lemma 15. The algorithm of Lemma If can be modified so that \Ck\ is guaranteed to be 0(n), 
with the running time still 0(n ) in expectation. 

Proof. We run the algorithm of Lemma 14 and repeat until the actual value of \Ck\ does not 
exceed twice its expectation. If the random bits used by subsequent iterations are independent, by 
Markov inequality each iteration succeeds with probability at least \. Consequently the probability 
that the z-th iteration is performed is at most ^t- The expected running time of a single iteration 
is 0(n ), so the total expected running time is 0(n). □ 
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10.3 Construction of the Representatives 

The most involved part of the construction algorithm is building a small piecewise constant repre¬ 
sentation of a representative partial assignment repr k . The choice of an appropriate assignment is 
already made by the algorithm of Lemma 15, which gives sets C k equipped with identifiers ID k \j] 
for j € C k . The subsequent step involves the following abstract function, whose implementation is 
described in Appendix D. 

Function Slider 

Input: Positive integers d < m and a set A of pairs ( q,p ) with q £ Z and p € [1, to]. 

Output: A piecewise constant representation of G : [1 ,m — d\ A defined as follows: G(i) is 
the lexicographically smallest pair ( q,p ) € A among pairs with p € [i, i + d\, T if no such pair 
exists. 


Lemma 16. Slider can be implemented in 0(|A|) time, provided that pairs in A are sorted by p 
in the input. 

We run Slider for {(ID k \j],j) : j € C k ) and d = 2 k , which gives a piecewise constant represen¬ 
tation of a function 


G(i) 


T if c k n [i,i + 2 k ] = 0, 

min {(ID k \j],j) € C k :je[i,i + 2 fc ]} otherwise. 


Since C k is guaranteed to contain all representative positions, whenever repr k (i ) = j / 1 we have 
G(i) = (ID k \j],j). Thus, in order to construct a piecewise constant representation of repr k , it 
suffices to find all the maximum intervals where repr k is defined, use a representation of G in that 
intervals and set _L elsewhere. 

Recall that we have defined N k = {j : BF k (j) is non-periodic} and block representations of all 
sets N k can be obtained in <D(n ) time (see Lemma 9). Also, note that repr k (i ) A _L if and only 
if [i, i + 2 k ] C N k . Therefore it suffices to take all intervals in the representation of N k , remove 
those of length at most 2 k and trim the remaining by 2 k positions from the right. Consequently, a 
piecewise constant representation of repr k can be constructed in time proportional to \C k \ and the 
size of the block representation of N k . Both terms sum up to 0(n ) for all values of k. 

Once we have repr k , we can run the construction algorithm of evaluator £(repr k ). We also 
prepare the set { (ID k [j], j)}, which is passed to the construction algorithm of a locator C{A k ), where 
A k is an indexed family of sets A k d , with A k d defined as the set of all representative occurrences of 
the /c-basic factor whose identifier ID k is id. 

Finally, we construct the global components, i.e. the data structure of Theorem 3 for the periodic 
case and the component of Fact 1. This way we obtain the result mentioned already in Section 1.2 
that concludes the whole description of the data structure for FACTOR-IN-FACTOR OCCURRENCE 
Queries. 

Theorem 1. Factor-in-Factor Occurrence Queries can be answered in 0(1 ) time by a data 
structure of size 0(n), which can be constructed in 0(n) expected time. 


11 Applications 

In this section we show how FACTOR-IN-FACTOR OCCURRENCE QUERIES can be used in answering 
other types of internal queries considered in this paper. 
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11.1 Prefix-Suffix Queries &; Period Queries 


Prefix-Suffix Queries 

Given factors x and y of v and a positive integer d, report all prefixes of x of length between d 
and 2d that are also suffixes of y (represented as an arithmetic progression of their lengths). 

Definition 3. If a word z is simultaneously a prefix of a word x and a suffix of a word y, we call 
it a prefix-suffix of a pair (x,y). 

First, observe that we can assume that |x| = |y|. Indeed, if |x| \y\ we can shorten the longer 

of the factors (removing the suffix in case of x and the prefix in case of y). Let d! be the common 
length of x and y. If d! > 2d, we can further shorten both x and y in the same manner as before, 
so that d' = 2d. On the other hand, if d' < d, clearly an empty set needs to be reported. Thus we 
may assume that d' € \d, 2d], which lets us apply the following lemma, see also Figure 6: 

Lemma IT ([22]). Let d be a positive integer and let x, y be words such that d < \x\ = \y\ < 2d. 
Let x' be the prefix of x of length d and let y' be the suffix of y of length d. The following conditions 
are equivalent for an integer £ > d: 

(a) (. x,y ) has a prefix-suffix of length l, 

(b) y' occurs in x at position £ — d + 1 and x' occurs in y at position \y\ — £ + 1. 


l-d 



l 

< 



y' 


X ]i x' 


\y\-t 



y I V 


d 


d 


Figure 6: A pair (x, y) has a prefix-suffix z of length £ if and only if y' and x' occur at certain 
positions in x and y, respectively. 

Denote the set of all positions where x' occurs in y by Occ(x',y ) and the set of all occurrences 
of y' in x by Occ(y',x). Lemma 17 implies that we need to compute the following set of lengths 

{o T d — 1 : o € Occ(y', x)} fl {|y| — o + 1 : o £ Occ(x', y)}. 

Since common elements of two arithmetic progressions form an arithmetic progression, we can 
already see that our result indeed forms an arithmetic progression. 

We can use Factor-in-Factor Occurrence Queries to find both Occ(x',y) and Occfiy',x), 
both in constant time (see Figure 7). Then, computing the result is a matter of shifting, reversing 
and intersecting arithmetic progressions. The former two operations can easily be implemented in 
constant time, but the latter requires more attention. If the length of one of the sequences that we 
intersect is constant, we can simply verify which elements belong to the other. Also, if the sequences 
have the same difference, intersecting them in 0(1) time is trivial. It has been proved in [22] that if 
\Occ(x' ,y)\ > 3 and \Occ{y',x)\ > 3, then both of these sets form arithmetic progressions with the 
same difference. Thus, the two aforementioned special cases of intersection suffice for our needs. 
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Occ(y', x) 


Occ{x', y) 


Figure 7: Prefix-Suffix Queries 

Consequently, the data structure of Theorem 1 can answer PREFIX-SUFFIX QUERIES in 0(1) 
time, which gives the following result. 

Theorem 5. Using a data structure of 0(n) size, which can be constructed in 0(n ) expected time, 
one can answer Prefix-Suffix Queries in 0(1) time. 


Period Queries 

Given a factor x of v , report all periods of x (represented by disjoint arithmetic progressions). 

Corollary 4. Using a data structure of 0(n) size, which can be constructed in O(n) expected time, 
one can answer PERIOD QUERIES in 0(log |x|) time. 

Proof. Period Queries can be answered using the data structure for Prefix-Suffix Queries. 
To compute all periods of x we use Prefix-Suffix Queries to find all borders of x (words which 
are simultaneously prefixes and suffixes of x) of length between 2 k — 1 and 2{2 k — 1) for each 
k € [0, [log(|x| + 1)J]. Lengths of borders can be easily transformed to periods, since any word x 
has period p if and only if it has a border of length \x\ — p. □ 


2-Period Queries 

Given a factor x of v, decide whether x is periodic and, if so, compute its shortest period. 

While 2 -Period Queries can be trivially reduced to Prefix-Suffix Queries (asking for 
borders of x of length at least ^r), our techniques give a much simpler solution, with an additional 
merit in the form of deterministic construction algorithm. 

Corollary 5. Using a data structure of 0(n) size, which can be constructed in O(n) time, one can 
answer 2-Period Queries in 0(1) time. 

Proof. Recall that for any periodic factor x we have defined run{x) as the run extending x , and as 
T if x is not periodic. Lemma 5 gives a data structure computing run(x) in 0(1) time, that can be 
constructed in O(n) deterministic time. Moreover, if x is periodic then per(x) = per(rrm(x)). □ 

11.2 Cyclic Equivalence Queries 

We define Rot(u) = u[n]u[l] ... u[n — 1]. Additionally, for an integer r we write Rot(u, r) to denote 
Rot applied r times on u. Note that Rot(u,r ) = Rot(u,r') if r = r' (mod n). We say that w is a 
cyclic rotation of u if there exists an integer r such that w = Rot(u,r). 

Cyclic Equivalence Queries 

Given factors x and y of v , decide whether x is a cyclic rotation of y and, if so, report all 
corresponding cyclic shift values (represented as an arithmetic progression). 
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Before we proceed with a solution, let us state a stronger version of Fact 1. For words x,y denote 
the length of the longest common prefix of x and y by lcp(x,y). 

Fact 5. Let v be a word of length n. After 0(n) preprocessing the following queries can be answered 
in 0(1) time for any words x, y represented as concatenations of 0(1) factors of v: 

(a) compute lcp(x,y), 

(b) decide if x = y, 

(c) for an integer p < |x| find the longest prefix of x which has period p. 

Proof. We use the same data structure as for Fact 1, i.e. the suffix array with its inverse, and the 
LCP array equipped with the data structure for range minimum queries. A classic application 
of this toolbox is computing lep for a pair of suffixes [7, 14]. A straightforward generalization of 
this algorithm lets us work with concatenations of a constant number of factors, which gives (a). 
Equality queries (b) are an immediate consequence of lep queries. For (c) it suffices to observe that 
the answer is equal to p + lcp(x, x\p + 1, |x|]). □ 

Now, we are ready to present an algorithm for CYCLIC EQUIVALENCE QUERIES. We can clearly 
assume that |x| = |y|. Let us denote the common length of x any y by d, and the desired set of cyclic 
shifts {rG [0,d-l] : y = Rot(x,r)} by R(x,y). The following observation not only is useful for 
computing R(x,y), but combined with Fact 3 also proves that this set indeed forms an arithmetic 
progression. 

Observation 4. Let x,y be words of common length. Then R(x,y) is equal to the set of positions 
among {0,..., |y| — 1} where x occurs in yy. 

Below, we give an algorithm which computes R(x,y) FI [0, i.e. cyclic shifts not exceeding 
I- Note that y = Rot(x , r) if and only if x = Rot(y , d — r ), so running this algorithm both for (x, y) 
and (■ y,x ) lets us easily retrieve R(x,y). 

Let x' be the prefix of x of length ["|~|. Note that any occurrence of x in yy at position < ^ 
induces an occurrence of x' in y. We use Factor-in-Factor OCCURRENCE Queries to find all 
positions where x' occurs in y, each of them is a candidate shift value. If the number of occurrences 
is constant (at most 2), we can verify each candidate, using Fact 5(b) to test whether x actually 
occurs in yy at the appropriate position. 


Occ(x',y) 



Figure 8: Cyclic Equivalence QUERIES 

Otherwise, Fact 3 guarantees that the occurrences lie at positions j,j + p ,... ,j + kp where 
p = per(x 7 ). We need to find out at which of these positions x actually occurs in yy. We apply 
Fact 5(c) to find two values: £, the length of the longest prefix of x which admits period p, and 
m, the length of the longest prefix of y[j, \y\]y which admits period p (see Figure 8). Observe that 
for any i G [0, k] the longest prefix of y[j + pi,\y\]y which admits period p has length m — pi. 
Consequently, we have two cases: 
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• If £ = \x\ (p is a period of x ), then x occurs in yy at all positions j + pi such that |x| > m — pi. 

• Otherwise, among the candidates considered, x may occur in yy only at position j + pi with 
£ = rn — pi. We check this candidate using Fact 5(b). 

Thus, the data structure for Factor-in-Factor Occurrence Queries accompanied with 
the one of Fact 5 can answer CYCLIC EQUIVALENCE QUERIES. This implies the following result. 

Theorem 6. Using a data structure of 0(n) size, which can be constructed in 0(n) expected time, 
one can answer Cyclic Equivalence Queries in 0(1) time. 

11.3 Generalized Substring Compression 

In this section we improve the results of [21] for Generalized Substring Compression Queries. 
Generalized Substring Compression Queries 

Given two factors x and y of v, compute LZ(x\y), that is the part of the LZ77 [34] compression 
LZ(y%x) corresponding to x, where $ ^ E. 

We actually provide a more efficient algorithm for the following auxiliary problem, introduced in [21]. 
Bounded Longest Common Prefix Queries 

Given two factors x and y of v, find the longest prefix p of x which is a factor of y. 

Before we proceed with a solution, we recall a number of tools related to suffix trees and mention 
several results developed in [21] for the original solution for BOUNDED LONGEST COMMON PREFIX 
Queries. 

11.3.1 Tools 

The suffix trie of v is the trie of all suffixes of v. Each factor x of v corresponds to a unique node in 
the suffix trie, called the locus of x, such that x is spelled by the letters on the path from the root 
to that node. 

The suffix tree of v [7, 14, 18], denoted T(v), is the compacted suffix trie of v, i.e. nodes that 
are not branching (with 2 or more children) nor terminal (loci of suffixes of v) are dissolved. The 
dissolved nodes are called implicit, the remaining nodes are called explicit. An implicit node x can 
be represented as a pair (u,d), where u is the lowest explicit descendant of x, and d is the distance 
(the number of letters) from x to u. The pair (u, d) is called the locus of the factor corresponding 
to x, and u is called its explicit locus. 

The suffix tree of v takes linear space and can be constructed in linear time provided that the 
letters of v can be sorted in linear time [7]. The following result is due to Gawrychowski [17]. 

Lemma 18. The suffix tree T(v) can be preprocessed in 0(|f|) time so that given integers i,k the 
locus of the basic factor BFk(i) can be determined in 0(1) time. 

We also use as a black-box several results developed in [21] for the original solution for BOUNDED 
Longest Common Prefix Queries. The result of Lemma 20 we have already referred to as a 
decision version of internal pattern matching queries. 

Interval Longest Common Prefix Queries 

Given an interval \£, r] and a factor x of v, find the longest prefix p of x which occurs at some 
position t € [£, r\ in v. 
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Lemma 19 ([21]). Using a data structure of 0[n+S rsucc ) size, one can answer Interval Longest 
Common Prefix Queries in 0(Q rsucc ) time, provided that x in given by its locus in T(v). 

Lemma 20 ([21]). For a word v of length n there exists a data structure of size 0{n + S remp t), such 
that given factors x,y one can decide whether x occurs in y in 0{Q remp t) time, provided that x is 
given by its locus in T ( v ). 

11.3.2 Query Algorithm 

Assume x = v[£',r'] and y = v\t, r\. First, we search for the largest k such that the prefix of x 
of length 2 k (BF occurs in y. We use a variant of binary search involving exponential search 
(also called galloping search), which requires O (log K) steps where K is the optimal value of k. At 
each step for a fixed k we need to decide if BFk(^') occurs in y. This can be done in 0(Q remp t ) 
time: we find the locus of BFk(P') using Lemma 18 and then apply Lemma 20. 

At this point we have an integer K such that the optimal prefix p has length 2 K < \p\ < 2 K+1 . 
So far the complexity was 0(Q remp t log K) = 0(Q remp t log log |p|) time. 

Let p' be the prefix obtained from INTERVAL LONGEST COMMON PREFIX QUERY for x and 
\I, r — 2 K+1 \. Note that BF r+i does not occur in x , so \p'\ < 2 K+1 and therefore the occurrence 
of p' starting within [£, r — 2 A+1 ] lies within y. Thus \p'\ < \p\] moreover, if p occurs at a position 
within [I, r — 2 K+1 \, then clearly p = p'. 

The other possibility is that p occurs in y only near its end, i.e. within the suffix of y of length 
2 a+1 , which we denote as ]/. Let x' be the prefix of x of length 2 K . Note that x' is a prefix 
of p, so any occurrence of p in y' induces an occurrence of x' in y'. We use FACTOR-IN-FACTOR 
OCCURRENCE Queries to locate all positions where x' occurs in y ', these are the only possible 
positions where p might occur in y'. If the number of these positions is constant (at most 2), we can 
verify each in constant time: it suffices to ask an lep query (see Fact 5(a)) for x and the appropriate 
suffix of y'. 

Otherwise we know that x' is periodic and we know its period q , which by Fact 3 is equal to 
the difference in the arithmetic progression of occurrences. Let y r be the suffix of y' starting with 
the r-th leftmost occurrence of x' in y'. We compute two values using Fact 5(c): d, the length of 
the longest prefix of x which admits period q, and d\, the length of the longest prefix of y± which 
admits period q (see Figure 9). Note that for y r the corresponding value is d r = d\ — (r — 1 )q. 


Occ{x', y ') 



Figure 9: Bounded Longest Common Prefix Queries 


Observation 5. Let u, u' be words such that lcp{u,u') > q. Let d (d!) be the length of the longest 
prefix of u (resp. v!) which has period q. If d d!, then lcp(u,u') = min(d, d'). Otherwise, 
lcp(u , u') > d. 
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Observation 5 lets us restrict our attention to y± (which maximizes min (d,d r )) and y r such that 
d = d r (if any). Thus, even if there are more occurrences of x' in y', we need to consider only 2 of 
them. 

Consequently, we can always choose the final solution as the best among three candidates: one 
obtained from the INTERVAL LONGEST COMMON PREFIX QUERY, and two corresponding to the 
occurrences of x' in y', with the actual lengths obtained using lep queries. 

Thus, the data structure for Factor-in-Factor Occurrence Queries, accompanied with 
the suffix tree T(v) and the data structure of Lemma 18, as well as the data structures of Lem¬ 
mas 19, 20 and Fact 5, can answer Bounded Longest Common Prefix Queries in 0(Q rsucc + 
Qrempt log log \p\ ) time. 

Lemma 21. Using a data structure of 0(n + S remp t + S rsucc ) size, one can answer Bounded 
Longest Common Prefix Queries in 0(Q rsucc + Q re m P t log log |p|) time. 

Theorem 7. Using a data structure of 0(n + S remp t + S rsucc ) size, one can answer Generalized 
Substring Compression Queries in 0(C(Q rsucc + Q rempt log log J^)) time, where C is the 
number of phrases reported. 

Proof. The algorithm for GENERALIZED SUBSTRING COMPRESSION QUERIES is identical to the 
one presented in [21], it just uses our solution for BOUNDED LONGEST COMMON PREFIX QUERIES 
instead of the original one. Thus, if the output phrases are of length pi,... ,pc> if runs in 
O ( £T =1 ( Qrsucc + Qrempt log log \pi |)) time, which using Jensen’s inequality for the concave function 
log log gives the desired time bound. □ 
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A Proof of Lemma 1(b) 

Lemma 1. Assume that a is a y- diverse sequence of length n over [1, m] and let 7 t be a permutation 
of [1 ,m\ drawn uniformly at random. Let A = {i : 7r(a;) < L\ for a parameter £, and C = 
FillGaps(M, A, [1, n]). Then 

(a) C contains all local it- minima, 

(b) if£= thenE[\C\} = 0(^^). 

Before we give the proof, let us recall a standard fact and apply it in an auxiliary claim. 

Fact 6. Let U be a set, T be its subset of size t, and let S be drawn uniformly at random from the 
family of subsets of U of size s. Then P[5 n T = 0] < (l — < exp ( — jyy ). 

Claim 1. Let P C [l,n] be an interval of size |~y~| . Then P[Ffl A = 0] < y. 

Proof. Let Vp = {a* : i € P}. Observe that y-diversity implies that \Vp\ = |P| < m. Note that 
P 0 A = 0 if and only if ir(Vp) 0 [1,£] = 0, or equivalently Vp O 7r _1 ([l,F]) = 0. Observe that 
L = 7t _1 ([1,F]) is a subset of [l,m] of size l drawn from the uniform distribution. Thus by Fact 6 

P[P n A = 0] = P[Vp n L = 0] < exp (-^) < exp (- - l) 

= exp ^_ 2 l p 0°g A 4. < exp (-log A + 1) < §■ □ 

Proof of Lemma 1(b). First, let us bound the expected size of A. 

n n 

E[l A\] = V m*i) = = i s 2^. 

3 = 1 i=l 

Let us consider a position j € C \ A. By definition of FillGaps there must be an integer interval 
R such that j € R C [1, n] \ A and \R\ > A. Let us define R<j = [j — [y] + 1 ,j\ and R>j = 

[j, j + |~y] — 1], Note that R<j C R or R>j C R. Moreover R<j C R implies R<j C [l,?r] and 
R<j fid = 0, by Claim 1 this holds with probability at most y. A similar reasoning holds for R>j. 
Therefore: 

n n 

n\C \A\] = '£F\jeC\A]<J2x = 0{%)- 

3 =1 3 =1 

Consequently E[|C|] = E[|M|] + E[|C \ A|] = O + f) = O . □ 

B Abstract Data Structures 

In this section with provide an efficient implementation of the data structure for successor queries, as 
well as two auxiliary data structures used as building blocks of our main construction: EVALUATOR 
and LOCATOR. Thus we prove Lemmas 10, 3 and 4. We start with a description of auxiliary data 
structures used in the first two lemmas. To prove the third lemma, we use perfect hashing [16]. 
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B.l Rank & Select 


For a bit vector B we define ranks (i) as the number of positions j < i such that B[j] = 1, and 
selects(i ) as the position of the i-th 1-bit in B, i.e. minimum j such that ranks(j) = i, _L if no 
such position exists. 

While there are many data structures computing rank and select (the classic ones can be found 
in [19, 5, 30]), most of them focus on space and query time bounds and do not provide an efficient 
construction algorithm. The word-RAM model allows to give the input bit vector B of length n 
in O(^f^) machine words, thus a desired construction time is O(j^f^). We are not aware of any 
paper or book providing a construction algorithm running in that time, and thus for the sake of 
completeness we describe such an algorithm below. 

Lemma 22. For a bit vector B of length n one can construct in O time a data structure of 

size O that can answer ranks an d selects queries in constant time. 

Proof. First, note that answers to all possible rank and select queries for bit vectors of size t = 
[-ylognj can be memoized in 0(n 1//2+£ ) = o time and space. 

Let us divide B into blocks B 1 ,..., B m of length £ (if B m is shorter, we append B m with zeros). 
We can arrange B so that each block corresponds to a single machine word. 

For ranks queries, we simply memoize the answers for positions divisible by t. For this, we use 
the precomputed results of rank queries to count ones in each block and then compute prefix sums 
of such a sequence. This clearly takes 0(j^^) time and space. To answer a query for a position j, 
we add up the result for the largest position divisible by i not exceeding j and the memoized result 
of a corresponding query in the block containing the j-th position. 

The data structure for select queries is more involved. For each i € [ 1, to] we store the index j 
of the i-th non-zero block (if it exists). We define a bit vector Ds, such that Ds[i] = 1 if the i-th 
1-bit in B is the first 1-bit in its block. Such a bit vector can be computed in 0(^^ + j) time: it 
suffices to start with a null vector and flip all bits of index rank s(k£) + 1 for k € [0, |_jJ ] . 

Observe that ranks B (i) = j implies that the i-th 1-bit of B lies in j-th non-zero block of B, whose 
index among all blocks has been precomputed. Once we know this index k and ranks((k — 1)£), a 
selects query reduces to a select query within the block, for which we have a memoized result. 

In total the components of the data structure for select queries clearly take O space and 

time to construct. □ 

B.2 Successor Queries 

Lemma 10. For an arbitrary set R C [1 ,n] there exists a data structure S(R) of size 0(j^-^), 
which answers successor queries on R (succsfi ) = min(i? 0 [i,n])) in 0(1 ) time. Moreover S(R) 
can be constructed in O(jf^) time if R is given as a bit vector. 

Proof. Let B be the bit vector representing R. Observe that succr(i) = selects (ranks(i — 1) + 1) 
(with ranks( 0) defined as 0, and selects(j) defined as 00 for j greater than the number of 1- 
bits in B). Thus, it suffices to use the data structure for ranks an d selects queries provided in 
Lemma 22. □ 
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B.3 Evaluator 


Evaluator 

Input: A function g : [l,n] —>• U that admits a piecewise constant representation of size m. 
The elements of U fit in 0(1 ) words. 

Queries: Given i compute g(i). 

Lemma 3. For g specified in a piecewise constant representation of size m, there exists an evaluator 
£(g ) of size 0(m+ that answers queries in 0(1) time and can be constructed in 0(m + 
time. 

Proof. Let R = {(£i,ri,Vi)} be the piecewise constant representation of g. We store the values Vi 
in an array indexed with i. Let us define a bit vector Bn with Br[j\ = 1 if and only if j = £i for 
some i. Bn will be represented in O(jfi^) words. Clearly Bn can be constructed in 0(jf^ + m) 
time starting from a null vector and setting Bn[£i\ = 1 for all i. Observe that g(j) = v t where 

i = rank BR (j )• 

We build the data structure for rank queries in Bn , applying Lemma 22. In total we obtain the 
desired 0(^^ + m) bounds on the space and construction time, and constant-time queries. □ 

B.4 Locator 

Locator 

Input: An indexed family A = ( A;) of 4-sparse subsets of [l,n]. 

Queries: Given an index i and a range P of length 0(d) return A* 0 P. 

Lemma 4. For a family A = (Af) there exists a locator C(A) of size 0('ff li |Aj|) that can answer 
queries in 0(1) time. It can be constructed in 0(^f li |Aj|) time given {( i,j ) : j € Aj}- 

Proof. We divide the universe [1, n] into blocks Bi = [Id, (£ + 1)4 — 1] for £ € Z. The data structure 
is based on perfect hashing, see [16]. For each j G A* we store an item with key (|_^J,*) anc l 
value (j, i). For a query we extend P to P' such that P' is composed of full blocks. Note that 
l-P'l < \P\ + 2d = 0(d). Let P' = [£d, I'd — 1], For each m € [£,£' — 1] (note that there are 0(1) such 
values) we retrieve all items with the key (m,i). Clearly this gives {(j, i) : j € P' (~l A*}. The size 
of this set is constant by the sparsity condition. Now it suffices to filter out pairs with j € P' \ P 
and return j for the remaining ones. □ 

C Algorithmic Tools for Periodic Case 

In this section we provide proofs of three lemmas of algorithmic nature that we used in the solution 
of the periodic case of Factor-in-Factor Occurrence Queries. Before that we present several 
more facts related to periodicities. The first one is a classic result. 

Lemma 23 (Three Squares Lemma [13, 9]). Let v\, V 2 , V‘.i be words such that v\ is a prefix of v\, 
v\ is a prefix of and v\ is primitive. Then |ui| + \v 2 \ < |f 3 1. 

Fact 7. (a) EkEaeTZ k (v) @ = °^> ( b ) E k l^(^)l = °( n )- 
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Proof. Let us fix a run a. We have 


l“l <r \ " M 

2^ IF - 2^ 2 F 

k : a £' R . k ( v ) k : per(a)<2 fc 


oo 

^ ^ FF 2fi°g P er (“)l 

k= [log per (a)] +1 


< 


l a l 

per(a) 


exp(a). 


Summing up over all runs and applying the bound for the sum of exponents of runs (Fact 2), we 
get (a). As for part (b), if a is a fc-run. then |a| > 2 k , so ^ > 1. Thus (b) is a consequence of (a): 


Ei^)i<E E w = °( n )- □ 

k k aGTZk(v) 


Observation 6 . Let be runs with period p. Then \a FI a'\ < p. 

Definition 4. We say that a word u is k-periodic, if u is periodic, \u\ > 2 k and per ( a) < 2 k . 

Fact 8 . Let u be a k-periodic fragment of v. Then run(u) is the unique run a such that per(n) < 
|u| — per(u) and uDa = u. Moreover run(u ) is a k-run and per (run(u)) = per(u) < 

Proof. The latter statement is an immediate consequence of the definitions. Assume that fd is a 
different run satisfying the aforementioned conditions. Note that both per(a) and per(/3) are periods 
of u and per (a) + per(/3) < |u|. Periodicity Lemma (Lemma 6 ) implies that per (a) = per(/3). 
However, u is a subfragment of a fl /3, which leads to a contradiction by Observation 6 . □ 

Lemma 24. Let U\,U 2 ,U 3 be k-periodic fragments of v, all starting at the same position i. Then 
run(ui), run{uf), run(u2) cannot be all distinct. 

Proof. For a proof by contradiction assume that runs a* = run{uf) are pairwise distinct. Observation 6 
implies that these runs must have pairwise distinct periods p % . Without loss of generality assume 
Pi < P 2 < P 3 , he. ui,U 2 ,U 3 are three periodic factors of length at least 2 k with different shortest 
periods pi < P 2 < P 3 < 2, k starting at the position i in v. By the Three Squares Lemma (Lemma 23) 
we conclude that p\ + p 2 < pz < 2 k . Now consider the /c-basic fragment u starting at position i. 
We have 2 p\ < 2 k , therefore u is a /c-periodic fragment. Observe that both a\ and «2 satisfy the 
statement of Fact 8, which implies that a\ = run(u ) = CC 2 , a contradiction. □ 

Fact 9. Any position lies within at most 2 runs of period p. 

Proof. Consider any three distinct runs ot\ = v[i\,j\], a .2 = u[* 2 ,J 2 ] an d 013 = u['i. 3 ; j:i] with period 
p. Assume that i\ < Z 2 < * 3 . From Observation 6 we get 


*3 > ji - P + 1 = ji - 12 + 1 - p + h = |or 2 | - p + h > P + 12 > P + ji ~ P + 1 > ji- 


Thus *3 > j 1 which means that on and 03 do not intersect. 


□ 


C.l Proofs of Algorithmic Lemmas for Periodic Case 

Lemma 5. There exists a data structure of 0[n) size, which given a fragment u returns run(u ) in 
constant time. Moreover, the data structure can be constructed in 0(n) time. 
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Proof. Consider a function Rk : [l,nfc] —>• 2^ k ^ which assigns to a position i the set of A-runs 
inducing a A-periodic fragment starting at position i (see Definition 4). Lemma 24 implies that 
|i?fc(i)| < 2 for each i. Note that for a = v[i,j\ we have a € Rk(i') if and only if if £ [i, min(j — 
2per (a),j - 2 k ) + 1], 

For a A-run a = v[i,j] define begk(a) = i and endk(a) = min(j — 2per(«),j — 2 k ). Observe 
that Rk(i ) ^ — 1) implies i = begk(ca) or i = endk(a) for some A-run a. Thus Rk admits a 

piecewise constant representation of size at most 2\7Zk(v)\. 

Summing up over all A and applying Fact 7 this implies that the total size of representation of 
Rk is 0(n), which means that EVALUATORS £(Rk ) take 0(n) space in total. The piecewise constant 
representation of Rk can easily be constructed by an algorithm which traverses the word from left 
to right maintaining a set A of (at most 2) A-runs. At each position i it removes from A all A-runs 
a with endk(a ) = i and adds those with begk(a ) = i. Such events can be prepared and sorted in 
0(n) time simultaneously for all k. 

We answer queries as follows. If u = v[i,j] is periodic, then it is A-periodic for k = [log |n|J, so 
the run inducing v[i,j] belongs to Rk(i)- We use £(Rk) to find all such A-runs in constant time. 
Finally, we check if a = run(u ) using the characterization of Fact 8, i.e. verifying that a covers u 
and per (a) < □ 


A-Run Locator 

Input: A word v of length n. 

Queries: Given an integer p and a range P C [l,n] with |P| = 0(2 k ), compute all a € 7 Zk{v) 
for which per(a) = p and aflP/l. 

Lemma 8. There exist k-run locators ICk(v ) that answer queries in 0(1) time, take 0(n ) space in 
total, and can be constructed in 0(n ) expected total time. 

Proof. The data structure is similar to LOCATOR, see Lemma 4. Let us divide [l,n] into blocks 
B i,..., B m of size 2 k (the last one possibly shorter). Note that m = O(^). We build a hash table, 
for each A-run a and each index i such that B{ overlaps a we store an item with key (i, per(a)) and 

value a. Note that the number of items is bounded by Z k ( v ) (fW ^)> which when summed 
over all A is 0(n) by Fact 7. Moreover, by Fact 9 there are at most 4 values for a fixed key. Indeed, 
any A-run intersecting a fixed block Bi must contain the first or the last position in that block, and 
each of these positions might be contained in at most two A-runs of period p. 

For a query range P we extend P to P' so that P' = Bi Li... Li Bj (with j — i = 0(1)) and find 
all A-runs of period a overlapping P'. As we have just shown, there are 0(1) such A-runs, so we can 
easily check which of them overlap with P , also duplicates can be removed in constant time. □ 

Lemma 9. For A € [0, [lognj] let P/, = {* £ [l,Rfc] : BFk(i) is periodic }. In 0(n ) time we can 
compute all sets Pk, each of them represented both in the block representation and as a bit vector. 

Proof. Each periodic A-basic fragment is induced by a unique A-run. Moreover, for a fixed A-run a 
the set of positions where A-basic fragments induced by a start, forms an interval. If a = v[i,j], 
the interval is [i,j — 2 k + 1] if per(a) < 2 k ~ l or 0 otherwise. 

In order to compute the block representations of Pk, it suffices to sort these intervals and join 
some of them so that we get blocks of Pk- This takes 0(n + \7Zk( v )\) = 0(n) time in total by 

Fact 7. The bit vector of Pk can be obtained from a block representation in time proportional to the 
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total size of both representations. We start with a null vector, and then for each block B of we set 
all bits corresponding to that block. Using bit-operations we can do this in 0(1 + jL^L) time. For a 
fixed k, the former terms sum up to the number of blocks (the size of the block representation) and 
the latter to D(p^) (the size of the bit vector representation). In total we obtain 0(n ) construction 
time over all /c’s. □ 

D Algorithmic Tools for Construction Algorithm 

In this section we present two of algorithmic tools which we use for the construction algorithm, 
therefore proving Lemmas 12, 13 and 16. 

D.l Compact DBF and Randomized DBF 

Recall the definitions of the suffix tree and loci as well as Lemma 18, all given in Section 11.3.1. We 
say that an explicit node u is a /c-basic node if u is an explicit locus of a /c-basic factor. Observe 
that there is a natural bijection between identifiers in DBF & and the /c-basic nodes. While storing 
it explicitly for all k takes too much space, we shall devise an alternative way to evaluate it. 

There are up to 2 n explicit nodes, let us assign them pre-order identifiers id. Note that such 
identifiers also preserve lexicographic order of the corresponding factors. We have the following 
observation. 

Observation 7. If the explicit locus of BF^(i) is u, then DBF^i] is the number of k-basic nodes 
with identifiers not exceeding id(u). 

Due to the observation, it suffices to store a bit vector B^ such that B}.(i] = 1 if and only if the 
explicit node u with id(u) = i is /c-basic. Then rank queries on B^ (see Appendix B) can be used to 
determine the identifier of u in DBF Similarly, select queries on B^ let us find the id of the node 
which corresponds to a given identifier in DBF k- In order to be able to report the occurrences of 
the /c-basic factor with 0(l)-time delay, for each explicit node we maintain pointers to the leftmost 
and rightmost terminal node in the corresponding subtree. Additionally, we maintain a linked list 
of terminal nodes in lexicographic order of their labels. 

We use Lemma 22 to efficiently construct the data structures for rank and select queries on B}.. 
but first we need to determine these bit vectors. Note that a single node can be /c-basic for many 
values of k, but these values form a range, since the set of lengths of factors, for which u is an 
explicit locus, forms a range. We can construct such ranges for each explicit node. For a range 
[k±, /C 2 ] of a node u, id(u) = i, we construct events (k\,i) and (/C 2 + 1, i). Then B^ can be computed 
from by flipping all bits i for which (k,i) is an event, with L/_i defined as a null vector. The 

number of events is linear, so in 0(n) time we can construct all vectors B *. and equip them with 
the data structure for rank and select queries. 

Now, answering queries is simple: for (1) we find the explicit locus of BF *,(/) using Lemma 18, 
and then determine its identifier using a rankB k query. For (2) we use a selects k query to get an 
explicit locus u. Note that the corresponding basic factor occurs at position i if and only if the 
suffix u[i,n] has its locus in the subtree rooted in u. Thus, it suffices to visit all terminal nodes 
in the subtree rooted in u. We use the pointers to leftmost and rightmost terminal node and the 
linked list of terminal nodes to visit them with 0(l)-time delay. Finally for (3) it suffices to note 
that mk is the number of /c-basic nodes, which is the total number of 1-bits in B This concludes 
the implementation of CompactDBF and gives the announced result. 
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Lemma 12. For a word v of length n there exists CompactDBF T>{y) which takes 0(ri) space, 
can he constructed in 0(n ) time and can answer (1) and (3) queries in 0(1) time, and (2) queries 
with 0(1) time delay per item reported. 

D.1.1 RandomizedDBF 

Lemma 13. For a word v of length n there exists a RandomizedDBF T>*(v) which takes O(n) 
space, can be constructed in O(n) expected tim.e and can answer (1) and (3) queries in 0(1) time, 
and (2) queries in with 0(1) time delay per item reported. 

Proof. To obtain random identifiers, it suffices to randomly shuffle identifiers id in the previous 
construction (in particular the bit vectors £>/,. are indexed using these random identifiers). Then for 
each k the identifiers of fc-basic nodes also form a random order, since a (uniformly) random order 
of a set induces a uniformly random order of any subset. □ 

D.2 Slider Function 

Function Slider 

Input: Positive integers d < m and a set A of pairs ( q,p ) with g £ Z and pG [1, to]. 

Output: A piecewise constant representation of G : [l,m — d\ -A A defined as follows: G(i ) is 
the lexicographically smallest pair ( q,p ) € A among pairs with pG [i,i + d], T if no such pair 
exists. 

Lemma 16. Slider can he implemented in 0(|A|) time, provided that pairs in A are sorted by p 
in the input. 

Before we proceed with the proof let us state a folklore result. 

Fact 10. A simple queue can be augmented so that it can return its minimal element (settling ties 
arbitrarily) and all operations enqueue, dequeue and find-min on the queue work in 0(1) amortized 
time. 

Proof of Lemma 16. We traverse all values iG [1, to — d] in the increasing order maintaining Qi = 
{(q,p) G d : p G [i,i + d]} stored in the augmented queue of Fact 10, with minima computed on 
pairs lexicographically. 

Note that G{i) = minQj, moreover Qi can be obtained from Qi-\ by enqueueing all pairs with 
their second coordinate equal to i + d and dequeueing all pairs with their second coordinate equal 
to i — 1. We can store these operations as events and sort the events (merging the lists of events of 
both kinds), so that we perform any work only for i with some events associated. For every such i 
we evaluate G(i) using the find-min query, and if the value is different than previously, we start a 
new interval in the piecewise constant representation of G. □ 
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